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Abstract — We revisit the problem of computing submatrices 
of the Cramer-Rao bound (CRB), which lower bounds the 
variance of any unbiased estimator of a vector parameter 0. 
We explore iterative methods that avoid direct inversion of 
the Fisher information matrix, which can be computationally 
expensive when the dimension of is large. The computation of 
the bound is related to the quadratic matrix program, where 
there are highly efficient methods for solving it. We present 
several methods, and show that algorithms in prior work are 
special instances of existing optimization algorithms. Some of 
these methods converge to the bound monotonically, but in 
particular, algorithms converging non-monotonically are much 
faster. We then extend the work to encompass the computation 
of the CRB when the Fisher information matrix is singular and 
when the parameter is subject to constraints. As an application, 
we consider the design of a data streaming algorithm for network 
measurement. 

Index Terms — Cramer-Rao bound. Fisher information, matrix 
functions, optimization, quadratic matrix program. 

I. Introduction 

The Cramer-Rao bound (CRB) ifTTl is important in quan- 
tifying the best achievable covariance bound on unbiased 
parameter estimation of n parameters 6. Under mild regu- 
larity conditions, the CRB is asymptotically achievable by 
the maximum likelihood estimator. The computation of the 
CRB is motivated by its importance in various engineering 
disciplines: medical imaging |6|, blind system identification 
II20I . and many others. 

A related quantity is the Fisher information matrix (FIM), 
whose inverse is the CRB. Unfortunately, direct inversion 
techniques are known for their high complexity in space 
(O(n^) bytes of storage) and time (0{n^) floating point 
operations or flops). Often, one is just interested in a portion 
of the covariance matrix. In medical imaging applications, 
for example, only a small region is of importance, which is 
related to the location of a tumor or lesion. In this instance, 
computing the full inverse of the FIM becomes especially 
tedious and intractable when the number of parameters is 
large. In some applications, the FIM itself is singular, and the 
resulting Moore-Penrose pseudoinverse computation is even 
more computationally demanding. Avoiding the additional 
overhead incurred from direct inversion or other forms of 
matrix decompositions (Cholesky, QR, LU decompositions, 
for example) becomes a strong motivation. 

Prior work Q, |I8] proves the tremendous savings in 
memory and computation by presenting several recursive 
algorithms computing only submatrices of the CRB. Hero 
and Fessler |7| developed algorithms based on matrix split- 
ting techniques, and statistical insight from the Expectation- 
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Maximization (EM) algorithm. Only O(n^) flops are required 
per iteration, which is advantageous if convergence is rapid, 
and the algorithms produce successive approximations that 
converge monotonically to the CRB. Exponential convergence 
was reported, resulting in computational savings, with the 
asymptotic rate of convergence governed by the relationship 
between the FIM of the observation space and the com- 
plete data space. This seminal work was further extended in 
[8j, where better choices of preconditioning matrices led to 
much faster convergence rates. Furthermore, if the requirement 
of monotonic convergence of the iterates to the bound is 
dispensed with, there exists several algorithms with even 
faster convergence rates. The work also presents a way of 
approximating the inverse of singular FIMs. 

In this paper, we show that the algorithms proposed in prior 
work are special instances of a more general framework related 
to solving a quadratic matrix program UJ, a generalization 
of the well-known quadratic program, a convex optimization 
problem fS]. The reformulation provides a framework to 
develop methods for fast computation of the CRB, and explore 
various computational trade-offs. Consequently, the vast litera- 
ture in convex optimization can be exploited. Our formulation 
enables us to extend to the cases when the parameters are 
constrained 0, ifTSl and when the Fisher information matrix is 
singular, with ease. The work done here may be of independent 
interest to other areas when a similar motivation is required. 
We then apply these methods on an application related to the 
design of a specific data streaming algorithm for measuring 
flows through a router By doing so, we are able to compare 
the performance of several constrained optimization methods. 

We denote all vectors and matrices with lower case and 
upper case bold letters respectively. Random variables are 
italicized upper case letters. Sets are denoted with upper case 
calligraphic font. We work entirely in the real space R. 
and §" denote the set of real-valued, symmetric positive 
definite and positive semidefinite matrices of size n. The 
matrix diag(x) is a diagonal matrix with elements of x on its 
diagonals. tr(A) and rank(A) denote the trace and rank of a 
matrix A respectively. The eigenvalues of A are denoted by 
-^i(A) > A2(A) > . . . > A„(A), ordered from maximum to 
minimum. Vector denotes the i-th canonical Euclidean basis 
in R". ||x||2 and ||X||^- denotes the Euclidean and Frobenius 
norm of vector x and matrix X respectively. Other notation 
will be defined when needed. 

II. Preliminaries 

A. Fisher information 

Let the real, non-random parameter vector be denoted by 
e = [6'i,6'2, . . . ,6l„]T. The parameter G 0, where C 
R" is an open set. Let {Pe}ee@ be a family of probability 
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measures for a certain random variable Y taking values in set 
y. Assume that Pq is absolutely continuous with respect to 
a dominating measure fi for each 6 E &. Thus, for each 9 
there exists a density function /(y; 6) = dPg/dfj, for Y. We 
define the expectation Ee[l^] = J y dPg whenever / |y| dPg 
is finite. 

We assume that the family of densities {/y (y; 0)}ee© 
is regular, i.e. satisfying the following three conditions: (1) 
/y (y; 9) is continuous on for /i-almost all y, (2) the log- 
likelihood log fy (y; 6) is mean-square differentiable in 9, and 
(3) V6(log/y(y;0) is mean-square continuous in 6. These 
conditions ensure the existence of the FIM 

3g :=Ee[Velog/y(y;0)][Vjlog/y(y;0)], (1) 

which is an n X n positive semidefinite matrix and is finite. 
With the assumption of the existence, continuity in 9 and 
absolute integrability in Y of the mixed partial differential 
operators {d^ /deid9j)fYiy;9), i,j = l,2,...,n, the FIM 
becomes equivalent to the Hessian of the mean of the curvature 
of log/y(y;0), 3e = -E^V^ log /y (y; 0). 

B. Cramer-Rao Bound 

The importance of the Fisher information is its relation to 
the Cramer-Rao bound (CRB). The CRB is a lower bound 
on the covariance matrix of any unbiased estimator of the 
parameter 9. For any unbiased estimator ^(y) on observations 
y, the relation is given by 

Emy)-9){9{y)~9f]>J^\ (2) 

In principle, it is possible to compute submatrices of the CRB 
by partitioning the Fisher information matrix into blocks, and 
then apply the matrix inversion lemma 15]. As reported in 
||7], methods such as sequential partitioning |9|, Cholesky and 
Gaussian elimination require 0{n^) flops. These methods have 
a high number of flops even if we are concerned with a small 
submatrix, for e.g. the covariance of just to <C n parameters, 
motivating our work. 

III. Formulation and Algorithms 

As a start, we assume a nonsingular Fisher information 
matrix Jg. We now consider the optimization problem 

min ^x^Jgx — hr^x, (3) 

an example of a quadratic program [2|. Let F{x.) 
ix^Jgx — b^x. In this case, the optimization problem is 
strictly convex and possesses a unique optimal. The unique 
optimal solution to this problem is x* = ^b. The general- 
ization of the above is the quadratic matrix program, 

min itr(X'rjeX) -trfB'^X). (4) 

Any feasible solution to (|4]i is a valid lower bound on the 
covariance of the unbiased estimate of 9, and the tightest 
lower bound (the CRB) is the global minimum to dH flTl . 
Matrix B focuses the computation on a submatrix of the CRB. 
The special case B = I„ is equivalent to performing the full 
inverse of Jg, while setting B = enables computation of 



the CRB of just a single 9k, k G {1, 2, • ■ • ,n}. The optimal 
solution to this problem is X* = Jg^B (see Appendix). 
Based on this, if an algorithm searches for the minimum of 
the optimization problems (O or (|4|i, it effectively computes 
b^Jg and B^Jg ^B respectively, essentially computing the 
CRB. 

A. Majorization-Minimization (MM) methods 

The optimization problems above can be solved via the 
majorization-minimization (MM) method, which generalizes 
the Expectation-Maximization (EM) method (see |10| for a 
tutorial). Thus, the algorithm in [7J is a special instance of 
MM. We show that the recursive bounds of Q are just the 
consequence of a special choice. 

We first consider the vector case. Define 

G(x;x(*'-)) := ix^Jgx - b^x + Q(x; x^^'). (5) 

The function (3(x;x^''')) must be chosen so that G(x;x^'^') 
majorizes F{x.). There are two properties to satisfy: (1) 
G(x;x(*=)) > i^(x)foran x, and (2) G(x('=) ; x^) = i^(x('=)). 
These requirements ensure that G(x;x'^'^)) lies above the 
surface of -F'(x) and is tangent at the point x = x'^'^-' ifTol . The 
function G(x;x'''')) is referred to as a surrogate function. By 
these properties, MM-based algorithms converge to the CRB 
monotonically. 

For example, suppose we have a matrix P G §" and P > 
Jg in the positive semidefinite sense, then 

Q(x;xW) :=i(x-xW)T(P-Jg)(x-xW) (6) 

is a popular choice. Minimizing (|5]l with the choice (|6]l w.rt. x 
results in a closed-form solution 

xC^+i) = (I„ - P-Ijg)x('^) + p-^b = xf'^) + p-\h - JgxW). 

(7) 

The above is simply a Jacobi iteration, with a preconditioner 
P |23|. Typically, P is chosen to be diagonal or near diagonal, 
as this facillitates simple computation of P^^. Setting P to the 
Fisher information matrix of the complete data space would 
yield the algorithm in [7]. For the matrix case, G(X; X''''^) :— 
itr^XTjgX - b^x) + Q(X;X('='), we have the choice 

g(X;XW) itr(^(X-XW)T(P- Jg)(X-XW)), to 
obtain a Jacobi iteration. 

The convergence rate for this particular choice is governed 
by the spectral radius p(I„ — P^^ Jg), which is the maximum 
magnitude eigenvalue of the matrix. Exponential convergence 
to the CRB is achieved by ensuring that /9(I„ — P^^Jg) < 1 
is as small as possible. It is for this reason (0(I„ — P^^Jg) 
is also referred to as the root convergence factor ||23l , which 
measures the asymptotic convergence rate. 

The power of MM lies in the great freedom of choice 
when designing g(X;XW). For fast convergence, one needs 
to choose a (5(X;X^'^^) that well-approximates the quadratic 
objective around X^'^'. Second, (5(X;X('^') is chosen in a 
way that it does not depend on quantities we desire, such as 
Jg or is computationally expensive, for instance, a dense P. 
These trade-offs make the algorithm design more of an art 
than science. 
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B. Gradient Descent (GD) methods 

Gradient descent methods rely on minimizing the function 
along particular search directions. At each iteration, two cru- 
cial elements are required: the search direction d''^^, and the 
size of the step cj^'^^. Algorithm [T] presents a generic outline 
of gradient descent methods. 



Algorithm 1 Generic implementation of gradient descent 
Require: e, error threshold 

1: x(") ^ Xinit 

2: while ||i^(xW) - i^(x('^-i))||2 > e do 
3: 0;'''+-^^ <- argmin^^ _F(x('') + ujS'''^) {Exact line 
search} 

5: dC^'+l) ^ A(x('=+1)) 

6: end while 



Gradient methods depend on the evaluation of a function 
A(x) which determines the search direction. For example, in 
classical gradient descent, this is simply gradient of F{x) at 
each iterate, A(x) = — Vx-F'(x) = b — Jgx. 

Exact line searches, however, can be computationally ex- 
pensive. With a fixed choice of oj such that uj < 2/A„(Je), 
the algorithm uses an inexact line search, equivalent to the 
Richardson iteration |23|. The Gauss-Seidel (GS) method 
performs A(x(''^') = -e(fc mod n)+i, « = l,2,...,n EH. 
Conjugate and preconditioned conjugate gradient algorithms 
l|23| also belong to this class, where the search directions 
are constructed by performing a Gram-Schmidt procedure. 
Hence, some of the recursive algorithms presented in [8 1 are all 
instances of gradient descent methods. Unlike the algorithms 
presented in the previous section, these algorithms generally 
have non-monotonic convergence to the CRB. We particularly 
advocate preconditioned conjugate gradient algorithms for 
their fast convergence, requiring only simple line searches, 
and shown to have excellent performance in Section |IV] 

The Newton-Raphson descent method is unusable here, 
since it requires the inverse of the Hessian of the objective 
function, which is 3g^, of which we are avoiding its direct 
computation. For this reason, it is much better to use methods 
with simple line searches with low memory requirements. 

The gradient search method can be applied to quadratic 
matrix programs. Some adaptation is needed, however, such as 
reformulating the problem to a suitable vectorized form (see 

ID). 

C. Extension to Singular Fisher Information 

Suppose now Jg is singular. Such matrices arise in some 
areas, such as blind channel estimation [20^ and positron 
emission tomography |i6J. The properties of singular were 
explored in [[l3l. IITtI. 

The approach taken in ["Sl was to add a perturbation to Je 
in order to make it nonsingular, and then compute its inverse 
via recursive algorithms described above. This approach only 
yields an approximation to the CRB, with increased compu- 
tational complexity. Instead, we take a completely different, 
more efficient, route. 



Assuming b G range(Je), consider the optimization prob- 
lem for the vector case, 

min i||b-Jex||2. (8) 

xGR" 

The optimization problem is convex and the solution is simply 
the minimum norm solution x* ~ b, where Jg denotes the 
Moore-Penrose pseudoinverse [S], which is unique. Thus, we 
can solve for the CRB without having to resort to the approach 
taken by 

The generalization of (O is 

min i||B- JeXIl;., (9) 
xeR" ^ 

assuming the column space of B is in range (Jg). Then, the 
minimum norm solution is X* = JgB. The optimization 
problems here can be solved via the MM or GD methods. 
As an example, using the MM method in the vector case, and 
choosing ^ results in x('^+^^ = x^''"' + P~^Je(b — Jex^^^^). 
A variation where P — vln with v > Ai(j0) is the well- 
known Landweber iteration lfT9l . The technique also applies 
to problem (|9]l. 

D. Extension to Constrained Fisher Information 

Certain constraints provide additional information, resulting 
in the reduction of estimator variance. A direct way of 
deriving the constrained Fisher information is to recompute the 
Fisher information with a new vector parameter 7 such that 
constraints are incorporated in 7. Unfortunately, it generally 
requires a nontrivial alteration of the p.d.f.'s dependence on 
7 instead. Often, the approach is analytically intractable or 
numerically complex. The papers ||4], ifTsl were motivated by 
this problem and provide analytic formulae to compute the 
constrained Fisher information matrix without reparameteri- 
zation. 

In the following, it is enough to assume the unconstrained 
Fisher information matrix e S" . It has been shown that 
inequality constraints do not affect the CRB [4 |, thus, we focus 
entirely on equality constraints. Assume there are p consistent 
and nonredundant equality constraints on 9, i.e. h{9) = 0. 
Let He G R"^^ denote the gradient of constraints h{6). By 
the nonredundancy of the constraints, rank(He) = p. Further- 
more, let \]g G M"^("^P) be a matrix whose column space is 
the orthogonal basis of the cokemel of He, i.e. H^Ue = 0, 
and UjUe = I„_p. 

Assuming that Uq JgUe is nonsingular, it has been shown 
that the CRB is simply ifTll 

X+ =Ue(U^JeUe)-iuT. (10) 

If Je is nonsingular, the above can be rewritten as = 
Jq^ - Jq ^He(HQ Jq ^He)+HQ Jq \ which is equivalent 
to the bound derived in |4|, by choosing Ue = In — 
Jq He(H^jQ He)+H^. 

The algorithms proposed in liT], JU no longer apply here. 
It is also hard to see how the recursive algorithms can be 
extended to account for parameter constraints. It turns out 
constraints can be incorporated into our framework naturally. 
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The solution to the optimization problem (proof in Ap- 
pendix |B]) 

min itr(XTjeX) - tr(BTX) (11) 
subject to H^X = 0. 

is simply 

X* = Ue(UjJeUe)-iUjB = X+B. (12) 

From this, we have computed submatrices of the constrained 
CRB, extending the work in |7|. The general shift to a 
quadratic matrix program instead enables us to consider con- 
straints naturally. If is nonsingular, then equation (fTZt is 
equivalent to 

X* = 3g ^B — Jg ^He(Hg Jg ^He)^Hg Jg ^B, (13) 

in agreement with the above. 

The algorithms discussed previously require some modi- 
fications to account for constraints. MM methods are still 
applicable by ensuring constraints are built into the recursion. 
GD methods such as the preconditioned conjugate gradients 
algorithm can be adapted with constraints (for e.g. ||3])- We 
test some of these methods below. 

IV. Application 

In this section, due to space limitations, we perform nu- 
merical experiments to test the efficiency of the algorithms on 
only one example. The application involves the optimization 
of a data streaming algorithm for the measurement of flows on 
networks, where the parameters are subject to constraints. 

A. Data Streaming Algorithm Optimization 

A flow is a series of packets with a common key, such as 
the source and destination Internet Protocol address. The flow 
size is defined as the number of packets it contains. We are 
interested in the flow size distribution = [61,62, ■■■ ,0n]'^ 
in a measurement interval of T seconds. Each 6k denotes the 
proportion of flows of size k, with n being the largest flow size. 
By definition, J2k=i^k = Ij^fc > 0,Vfc (the strict inequality 
of the latter constraint to ensure no bias issues arise, see II2TI ). 
The gradient of the equality constraint is 1„. 

Data streaming algorithms are used for measuring flows 
on core networks, since the huge volume and speed of 
flows imposes strict memory and processing requirements. 
The advantage of these algorithms is the small amount of 
memory required, at the expense of introducing some error 
when recovering flow traffic statistics. 

The data streaming algorithm we consider is the flow 
sampling-sketcher (FSS). FSS has an array of A counters. 
Every incoming flow is selected with i.i.d. probability p and 
dropped with probability q. Sampling is performed via the use 
of a sampling hash function hs{x) with full range R, which 
acts on a flow key x, configured such that a packet is accepted 
if hs (x) < pR. The deterministic nature of the hash function 
ensures that packets belonging to a sampled flow will be 
always sampled and vice versa. For packets of sampled flows, 
another hash function hc{x) with range A generates an index. 



ensuring that the same counter is incremented by packets from 
the same flow. The counter with the corresponding index is 
incremented once per packet. Note that several flows can be 
mapped to the same counter, resulting in collisions. Once 
the measurement interval is over, the flow size distribution is 
recovered by employing an EM algorithm. FSS is practically 
implementable in routers. The schemes in lfT2l . llT6i are closest 
in spirit to FSS. 

Let Nf be the total number of flows in the measurement 
interval and a' — pNf/A denote the average number of flows 
in a counter Assuming fixed A (i.e. fixed memory allocation), 
the latter has a direct impact on estimation quality, as a' 
controls the flow collisions in the counters. Sampling with low 
p would increase variance due to missing flows, while high p 
results in many flows mapping to the same counter, increasing 
ambiguity due to collisions. Due to the dependence between a 
flow size k on flows smaller than it in each counter, different 
optimal sampling rates minimize the estimator variance for 
each 6k- The objective is to find p^ for a particular target 
flow size k, for e.g. k ~ 1 which is especially important for 
detecting network attacks. 

We use the Poisson approximation to compute the counter 
array load distribution, cg = [co{6),ci{9), ■ ■ ■]'^ . The gen- 
erating function of the load distribution is (|s| < 1 for 
convergence) 

n 

C*{s;0) = []e"'«'=(^'-i) =e-"' ■e^^=i"'«'=^', (14) 

k=l 

essentially a convolution of n weighted Poisson mass functions 
(see 1221 ). The distribution cg is obtained from the coefficients 
of the polynomial expansion of C*{s; 0), easily computed via 
the Fast Fourier Transform (FFT). Examples include cq{9) — 
e-"', ci{0) = a'6ie~"', 02(0) = ia'62 + 2^)e-"'. Let 
We = dia.g{c^^{9),C2^{0), ■ ■ ■)■ The unconstrained Fisher 
information is 

3e{p) - qlnll + pa'GjWeGe, (15) 

where the matrix Gg is a quasi-Toeplitz matrix with generating 
sequence cg, and 3g{p) is positive definite Vp f2l\. The 
matrix is dense and is challenging to compute. Computing the 
constrained Fisher information Xg (p) is even more difficult 
(c.f. ([Toll). Since we only need to compute individual diagonal 
entries of Xq (p), as each fc-th diagonal is the CRB of the 
estimator variance of 6k, and defining 

5(x*,p) =minxeR" 3g{p)x - ejx (16) 

subject to l^x ~ 0, 

we can use, assuming unimodality of g{x*,p) w.rt. p, (see 
Appendix for justification) 

P*k = argmaxpg(04] 9{x*,p). (17) 

For fixed p, since problem (fTSI l is equivalent to (fTTT i , we can 
use efficient optimization algorithms to solve it, avoiding full 
inversion. Problem (fTTI l can be solved by a golden section 
search ifTSl . The generality of our approach allows us to, for 
e.g. compute pfj,^i, the optimal sampling rate that minimizes 
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Fig. 1. Sampling rate p against the square root of CRB of di for the 
distiibution Abilene-III, truncated at n = 2, 000, and a = 4. The optimal 
sampling rate p* = 0.2342, denoted by the dot on the curve. 

the joint variance of 6^ to 6^, achieved by replacing with 
a matrix B which is zero everywhere except for the matrix 
If-fc+i at the l-lh position, and use (fTTI) in place of problem 
dlSll. 

B. Numerical Results 

Our focus is the computation of (fTSI l using the various 
methods discussed earlier We tested the algorithms on the 
important case of k = 1. The distribution used is a truncated 
version of the Abilene-III |[l4l, truncated to n = 2,000 
packets to the satisfy the parameter constraints. Here, a — A, 
p\ = 0.2342 and tolerance for all iterative algorithms is within 
e = 10-6 of the true CRB. For reference, [Xj(p|)]ii ^ 
1.67005. While it is unknown if the sampling rate-CRB curve 
is strictly convex for all 9k, in this case, it is (see Figureflla)). 
We omit dependence on p in the following since p — p*. 

In practice, cg is truncated up to a sufficiently large number 
of terms K and computed using Fast Fourier transforms. In 
what follows, we assume cg has been computed. Define Jg^M 
to be the Fisher information computed with cg up to M terms. 
Then, K is chosen as the value when HJe.if — Je.K-illF < S, 
i.e. a preset tolerance (5 > 0. With the value of K terms, it takes 
n?{K + 1) flops to construct 3g. In our case, K = 10,000. 
We assume that Jg is computed and stored upfront for all 
methods. In our case, this is cheaper than recomputing vector- 
matrix products Jex, due to the complexity of Gg, at the 
expense of higher memory storage. 

If we perform full inversion, it takes flops via 

Cholesky methods, followed by flops to construct the 
term Jg ^l„(l^Jg ^l„)^^l^Jg ^. Thus, it takes a total of 
+ in^ flops, and with n = 2, 000, it takes 2.68 Gflops. 
In contrast, for recursive methods, each iteration requires 
[n + 1)^ flops, with the additional requirement to account 
for constraints. Depending on the method, additional oper- 
ations might be required such as computing the diagonal 
preconditioner, which would require about 9n flops (see [8J). 
Generally, O(n^) flops per iteration are required for the 
following methods. 



We compare two classes: Constrained Majorization- 
Minimization (CMM) and Constrained Preconditioned Conju- 
gate Gradient (CPCG). For CMM, we have CMM-CF where 
the Fisher information of the complete data space, Je = 
Q!diag(f?j^^, (?2^^, • ■ • ,9^^) was used as the preconditioning 
matrix. CMM-DD instead uses the first order diagonally 
dominant matrix of Je (see ID for more details). The recursion 
step for CMM was derived using Lagrangian multipliers to 
account for constraints when minimizing (|5). For CPCG, the 
preconditioner matrix used is 3g. 

We also tested Gradient Projection (GP), which is the stan- 
dard GD algorithm using 3g as preconditioner, but accounts 
for constraints and uses exact line searches. GP, however, 
diverged for all iterations. Even without a preconditioner, the 
results remain the same. We omit its results and explain its 
poor performance later on. 

All algorithms were initialized with the same initial point. 
The breakeven threshold, i.e. the number of iterations before 
all methods would lose computational advantage to a direct 
evaluation of the CRB is 667 iterations. Table H] presents the 
comparison between different algorithms. The second column 
lists the root convergence factor of each algorithm. The root 
convergence factor, p is defined differently for each method. 
For CMM, refer to Section |III]For CPCG, it is p = 

where k = '^^^'^e ^•'^^ ||23l . The third and fourth columns 

A„(Jg Je) 

lists the number of iterations required for the result from the 
algorithms to be 5% and 0.5% respectively, tolerance of the 
CRB. The fifth column denotes the number of iterations before 
the algorithm reaches within tolerance e of the CRB. 

While all CMM algorithms converge mono tonic ally to the 
bound, they are extremely slow. The monotonic convergence 
of both algorithms can be seen in Figure |2] Clearly, CMM-CF 
has a faster convergence rate compared to CMM-DD. CPCG 
performs the best; however, its iterates have non-monotonic 
convergence. CMM and GP perform badly due to the small 
condition number of Jg, which is 2.73 x 10^. In particular, for 
GP, the projected descent steps move in a circular trajectory. 
Note that for all methods, the root convergence factor is a 
good predictor of the total number of iterations needed for 
convergence, but is not predictive of the number of iterations 
needed to be within 5% and 0.5% of the bound. Furthermore, 
only CPCG possesses some robustness with respect to the 
selection of the initial point. The other algorithms have a 
strong dependence on the initial point, and may have poor 
performance with a bad initial point choice. Finally, CPCG 
is the only algorithm that converges within the breakeven 
threshold. 

We also compare CPCG against a straightforward way of 
evaluation using the GS method. We use GS to evaluate two 
quantities separately: Jg^ei and J^^l„, and then use ( fTsl ) 
to evaluate x*. The trajectory of this method was compared 
with the trajectory of CPCG in Figure |3] Initialization for 
GS requires two initial points for evaluation of the two 
quantities. To ensure fairness, CPCG was initialized using the 
first evaluated point of GS. GS reaches to within 5% and 
0.5% of the bound in 8 and 10 iterations, and requires 110 
iterations for convergence. In contrast, CPCG takes 5, 6 and 
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Alg. 


P 


5% 


0.5% 


Convergence 


CMM-DD 


0.9998 


12,935 


23,549 


69,026 


CMM-CF 


0.9986 


161 


407 


8,487 


CPCG 


0.8715 


5 


7 


48 



TABLE I 

Asymptotic and finite convergence properties of the iterative 
algorithms 



64 iterations for 5% and 0.5% of the bound, and convergence 
respectively. The reason for the slow convergence of GS is due 
to the oscillations occurring near the end, as this method does 
not perform a methodical search across the constrained space, 
unlike CPCG. Clearly, CPCG is far superior to this method. As 
seen in Figure [3] both methods converge non-monotonically 
to the true CRB. 
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Fig. 2. Trajectory of CMM-CF and CMM-DD when computing the square 
root of CRB of Q\ for the distribution Abilene-111, truncated at n = 2, 000, 
and « = 4, shown here up to 500 iterations. Tolerance e = 10^^. CMM-CF 
converged in 8,487 iterations, while CMM-DD converged in 69,026 iterations. 
Note the monotonic convergence of both algorithms to the true CRB. 



V. Conclusion 

In this paper, we revisit the problem of computing subma- 
trices of the CRB. We show that computation of these sub- 
matrices are related to a quadratic matrix program. Due to the 
properties of the FIM and the convexity of the quadratic matrix 
program, we can compute the submatrices with efficient algo- 
rithms from convex optimization literature. We further show 
how the framework here easily extends to the case when the 
FIM is singular, and when parameter constraints are present. 
We then apply the algorithms on a constrained optimization 
problem, showing that the computation of these bounds can be 
evaluated efficiently for important signal processing problems. 
Future work includes exploring more algorithms for evaluation 
that may possess faster convergence rates and testing on other 
constrained problems. 

Appendix 

A. Derivation of the Optimal Solution of ^ 

The derivation relies on the relations Vxtr(X^JeX) = 
2J6(X and Vxtr(B'^X) = B. Using these relations, the 
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Fig. 3. Trajectory of the GS and CPCG when computing the square root 
of CRB of Q\ for the distribution Abilene-Ill, truncated at n = 2, 000, and 
a = 4. Tolerance e = lO"*". GS converged in 110 iterations, while CPCG 
converged in 64 iterations. 



gradient of the problem is JgX — B. Setting this to 0, we 
obtain the optimal solution. 

B. Derivation of ( 1121 ) 

Since G S", the objective is a convex function, and 
the constraints are linear, thus, the optimization problem re- 
mains convex. By using Lagrangian multipliers Z, the optimal 
solution obeys J^X - B - H^Z ^ and H^X = 0. 
This implies that the feasible solutions of X has the structure 
X = UeY, where Y e ]R(»-p)x'«, as solutions must lie in 
the cokernel of He and the range space of Ug. Proceeding 
in this fashion, we obtain JgUeY = B. Multiplying by 
on both sides, we then get Y* = (U^JeUe)-iUjB. 
Multiplying Y* again by Ue, we obtain the optimal solution 
X*. In the case of nonsingular Je, one can choose Ue = 
1)1 — ^He(Hg Jg ^He)+Hg as it lies in the cokemel of He 
and is orthogonal (see details in |i4J). Then, X* is equivalent 
to O- 



C. Formulation of ( 1171 ) 

Consider the objective function once the inner mini- 
mization problem is solved. The objective function yields 

— ^[Xg {p')]kk, for some particular rate sampling rate p'. Now, 
p^ is the optimal value if and only if —^[Xg{p*f,)]kk > 

— ^[Xg{p')]kk for all other p' ^ p^. By maximizing the 
objective function over p, we solve for pj. 

D. Derivation of the Constrained MM 

We prove the result for the vector case. Similar derivation 
applies for the matrix case. As discussed in Section |III] we 
use Q(x;x('=)) i(x-xW)T(P- Je)(x-xW). Then, the 
task is to minimize 



1 



:= ix^Jex 
2 



G(x;x 

subject to the constraint HgX 



Q(x; x' 



(fc)^ 



7 



Using the method of Lagrangian muhiphers |l2], we con- 
struct the Lagrangian, with multipliers /x G IR^*, 

L(x,/x;x(^-)) = G(x;xW)+/xThTx. (18) 
At the optimal point, there are two equations to satisfy: 
Vx2v(x, x^'^) ) = Px + (P - Je)x('=) - b + He/x = 0, 
V^L(x,/x;xW) =HTx = 0. 
Solving both equations, we have 

= (H^p-iH,)-iH^((I„ - P-Ij,)x(^-) - P-Ib) , 
which exists, since we choose P G Finally, 

x(^-+i) = (I„ - p-iJe)x(^) - p-^b - HeM^'+'^ 
= Te((I„-P-%)xW -P-ib). 

where Te = I„ - He(HgP"^H6))"^Hg is a projection 
operator Note its similarity to the basic Jacobi iteration, 
except with the projection Tg to account for the parameter 
constraints. 

References 

[1] A. Beck. Quadratic matrix programming. SI AM J. Optim., 17(4): 1224- 
1238, 2007. 

[2] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge 
University Press, 2004. 

[3] T. F. Coleman. Linearly constrained optimization and projected pre- 
conditioned conjugate gradients. In Proc. 5th SIAM Conf. on App. Lin. 
Algebra, pages 118-122, 1994. 

[4] J. D. Gorman and A. O. Hero. Lower bounds for parametric estimation 
with constraints. IEEE Trans. Info. Theory, 36(6):1285-1301, November 
1990. 

[5] D. Harville. Matrix Algebra from a Statistician's Perspective. Springer- 
Verlag, 1997. 

[6] A. Hero, J. Fessler, and M. Usman. Exploring estimator bias-vaiiance 
tradeolfs using the Uniform CR bound. IEEE Trans. Sig. Proc, 
44(8):2026-2041, August 1996. 

[7] A. O. Hero and J. Fessler. A recursive algorithm for computing Cramer- 
Rao-type bounds on estimator covariance. IEEE Trans. Info. Theory, 
40(4):1205-1210, July 1994. 

[8] A. O. Hero, M. Usman, A. C. Sauve, and J. Fessler. Recursive 
algorithms for computing the Cramer-Rao bound. Technical Report 
305, Communication and Signal Processing Laboratory, University of 
Michigan, Ann Arbor, November 1996. 

[9] R. A. Horn and C. R. Johnson. Matrix Analysis. Cambridge University 
Press, 1985. 

[10] D. R. Hunter and K. Lange. A tutorial on MM algoiithms. The American 

Statistician, 58(l):30-37, February 1996. 
[11] S. M. Kay. Fundamentals of Statistical Signal Processing, Volume I: 

Estimation Theory. Prentice Hall PTR, March 1993. 
[12] A. Kumar, M. Sung, J. Xu, and J. Wang. Data streaming algorithms for 

efficient and accurate estimation of flow size distribution. In Proc. of 

ACM SIGMETRICS 2004, New York, June 2004. 
[13] R. C. Liu and L. D. Brown. Nonexistence of informative unbiased 

estimators in singular problems. Ann. Stat., 21(1), 1993. 
[14] NLANR. Abilene-lII Trace Data. http://pma.nlam-.net/Special/ipls3.html. 
[15] W. H. Press, S. A. Teukolsky, W. T. Vetteriing, and B. P Flannery. 

Numerical Recipes: The Art of Scientific Computing. Cambridge 

University Press, 3rd edition, 2007. 
[16] B. Ribeiro, T. Ye, and D. Towsley. A resource minimalist flow size his- 
togram estimator. In Proc. 2008 ACM SIGCOMM Internet Measurement 

Conference, pages 285-290, Vouliagmeni, Greece, October 2008. 
[17] P. Stoica and T. Marzetta. Parameter estimation problems with singular 

information matrices. IEEE Trans. Sig. Proc, 49(1), January 2001. 
[18] P. Stoica and B. C. Ng. On the Cramer-Rao bound under paramehic 

constraints. IEEE Signal Processing Letters, 5(7): 177-179, July 1998. 
[19] O. N. Strand. Theory and methods related to the singular-function 

expansion and Landweber's iteration for integral equations of the first 

kind. SIAM I Numer Anal, 11:798-825, September 1974. 



[20] J. R. Treichler. Special issue on: Blind system identification and 

estimation. Proc IEEE, 86, October 1998. 
[21] P. Tune and D. Veitch. Fisher information in flow size distribution 

estimation. IEEE Trans. Info. Theory, 57(10):701 1-7035, October 2011. 
[22] P. Tune and D. Veitch. Samphng vs Sketching: An Information Theoretic 

Comparison. In IEEE Infocom 2011, pages 2105-21 13, Shanghai, China, 

April 10-15 2011. 

[23] D. S. Watkins. Fundamentals of Matrix Computations. Wiley- 
Interscience, 2nd edition, 2002. 



